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Abstract 

The described tagger is based on a hidden Markov 
model and uses tags composed of features such as part- 
of-speech, gender, etc. The contextual probability of a 
tag (state transition probability) is deduced from the 
contextual probabilities of its feature- value-pairs. 

This approach is advantageous when the available 
training corpus is small and the tag set large, which 
can be the case with morphologically rich languages. 
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The present article describes a probabilistic tagger 
based on a hidden Markov model (HMM) (Rabiner, 
1990) and employs tags which are feature structures. 
Their features concern part-of-speech (POS), gender, 
number, etc. and have only atomic values. 

Usually, the contextual probability of a tag (state 
transition probability) is estimated dividing a trigram 
frequency by a bigram frequency (second order HMM). 
With a large tag set resulting from the fact that the 
tags contain besides of the POS a lot of morphologi- 
cal information, and with only a small training corpus 
available, most of these frequencies are too low for an 
exact estimation of contextual probabilities. 

Our feature structure tagger estimates these prob- 
abilities by connecting contextual probabilities of the 
sm^e feature- value- pairs (fv-pairs) of the tags (cf. sec. 

2). ^ ^ 

Starting point for the implementation of the fea- 
ture structure tagger was a second-order-HMM tagger 
(trigrams) based on a modified version of the Viterbi 
algorithm (Viterbi, 1967; Church, 1988) which we had 
earlier implemented in C (Kempe ,1994). There we 
modified the calculus of the contextual probabilities 
of the tags in the above-described way (cf sec. 4). 

A test of both taggers under the same conditions on 
a French corpus^ has shown that the feature structure 
tagger is clearly better when the available training cor- 
pus is small and the tag set is large but the tags are 
decomposable into relatively few fv-pairs. The latter 
can be the case with morphologically rich languages 
when the tags contain a lot of morphological informa- 
tion (cf. sec. 5). 



-^I am much obliged to Achim Stein and Leo Wanner, Ro- 
mance Dept., Univ. Stuttgart, Germany, for providing the cor- 
pus and a dictionary. 
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In order to assign tags to a word sequence, a HMM can 
be used where the tagger selects among all possible 
tag sequences the most probable one (Garside, Leech 
and Sampson, 1987; Church, 1988; Brown et al., 1989; 
Rabiner, 1990). The joint probability of a tag sequence 
t = to-.-tN-i given a word sequence w = wq-.-w^^-i is 
in the case of a second order HMM: 



p(t,w) = TTt^tj •p(wol^o) -piwiltl) ■ 
N-1 

Y\_ [pi^i\ii) ■ Pi^il ti-i 



)) (1) 



The term ttj^j stands for the initial state probabil- 
ity, i.e. the probability that the sequence begins with 
the first two tags. N is the number of words in the 
sequence, i.e. the corpus size. The term p(wi\ti) is the 
probability of a word Wi in the context of the assigned 
tag ti. It is called observation symbol probability (lex- 
ical probability) and can be estimated by: 

f{m ti) 



Piwilti) 



(2) 



The second order state transition probability (contex- 
tual probability) p{ti\ ti-2 ii-i) in formula (1) ex- 
presses how probable it is that the tag ti appears in 
the context of its two preceding tags tj_2 and It 
is usually estimated as the ratio of the frequency of 
the trigram (ti-2,ii-i,ii) in a given training corpus 
to the frequency of the bigram (ti-2,ii-i) in the same 
corpus: 



pj{ti \ ti_2 ti-l) 



(3) 



With a large tag set and a relatively small hand- 
tagged training corpus formula (3) has an important 
disadvantage: The majority of transition probabilities 
cannot be estimated exactly because most of the possi- 
ble trigrams (sequences of three consecutive tags) will 
not appear at all or only a few times^. 

In our example we have a French training corpus 
of 10,000 words tagged with a set of 386 different 
tags which could form 386^ = 57,512,456 different 
trigrams, but because of the corpus size no more 
than 10,000-2 trigrams can appear. Actually, their 
number was only 4,815, i.e. 0.008 % of all possible 



^ A detailed description of problems caused by small and zero 
frequencies was given by Gale and Church (1989) 



ones, because some of them appeared more than once 
(table 1). 



frequency 
range 


number and percentage 
of trigrams in the range 




1 I'D 091 %\ 


fi4 1 97 


9 I'D 049 




1 I'D 9fi 






8 - 15 


119 (2.5 %) 


4-7 


282 (5.9 %) 


2-3 


860 (18 %) 


1 


3,495 (73 %) 


sum 


4,815 (100 %) 



Table 1: Trigram count from a French train- 
ing corpus of 10,000 words 

When we divide e.g. a trigram frequency 1 by a 
bigram frequency 2 according to formula (3) we get 
the probability p=0.5 but we cannot trust it to be 
exact because the frequencies it is based on are too 
small. 

We can take advantage of the fact that the 386 tags 
are constituted by only 57 different fv-pairs concerning 
POS, gender, number, etc. If we consider probabilistic 
relations between single fv-pairs then we get higher 
frequencies (fig. 2) and the resulting probabilities are 
more exact. 

From the equations 

{ti} = {ew n en ... n ei_„_i} = S H ^ik > (4) 

{.k = } 

where ti means a tag and the eik symbolize its fv-pairs 
and 



C'i] ■p(Ci)=pi(f] eik)nCi 



/n-1 

P I n eik 

\k = / \ k = 

= p(Ci) -pieiolCi) -pieiild n Cio) ■ 

/ n-2 

■ ■■pi ej>-i Ci n Pi Cik 

\ k = 



(5) 



where C'i means the context of ti and contains the tags 
ti-2 and ti-i follows 



p(ti\C'i) = p(eio\Ci) 




k-l 



Ci n Pi Ci 

j=0 



(6) 



The latter formula^ describes the relation between 
the contextual probability of a tag and the contextual 
probabilities of its fv-pairs. 

The unification of morphological features inside 
a noun phrase is accomplished indirectly. In a 
given context of fv-pairs the correct fv-pair obtains 
the probability p=l and therefore will not infiuence 
the probability of the tag to which it belongs (e.g. 
pK Onum:SG |...) = 1 in fig. 2). A wrong fv-pair 
would obtain p=0 and make the whole tag impossible. 

■^suggested by Mats Rooth, IMS, Univ. Stuttgart, Germany 



3 TRAINING ALGORITHM 

In the training process we are not interested in 
analysing and storing the contextual probabilities 
(state transition probabilities) of whole tags but of 
single fv-pairs. We note them in terms of probabilistic 
feature relations (PFR): 

PFR:{e,\Cr';p{e^\Cr')) (7) 

which later, in the tagging process, will be combined 
in order to obtain the contextual tag probabilities. 

The term e,- in formula (7) is a fv-pair. C?"' is a 
reduced context which contains only a subset of the 
fv-pairs of a really appearing context Ci (fig. 1). C?"' 
is obtained from Ci by eliminating all fv-pairs which 
do not infiuence the relative frequency of e,- , according 
to the condition: 

p{e.\Cr') / p{e^\Ci) e [l-e,l + e] (8) 

The considered fv-pair has nearly"* the same prob- 
ability in the complete and in the reduced contexts, 
i.e. Ci does not supply more information about the 
probability of e,- than C?"' does. 



(la) 

2pos:DET 
2typ:DEF 
2gen:FEM 
2num:SG 

(lb) 



tt-1 

lpos:NOUN 

lgen:FEM 

InumiSG 



lgen:FEM 



Opos:ADJ 

Ogen:FEM 

OnumiSG 



Opos:ADJ 

Ogen:FEM 



Figure 1: (a) Complete context Ci and (b) 
reduced context C?"' of the feature- value-pair 
ei = OgeniFEM 

In the example (fig. la) we consider the fv-pair 
Ogen:FEM . Within the given training corpus, its prob- 
ability in the complete context Ci, i.e. in the context 
of all the other fv-pairs of figure la, is Pq=44/44=1 
(cf. p*o in fig. 2). 

The presence of lnum:SG in tag does not infiu- 
ence the probability of Ogen:FEM in tag ti. Therefore 
lnum:SG can be eliminated. Only fv-pairs which re- 
ally have an infiuence remain in the context. The re- 
duced context C/"' with less fv-pairs, which we obtain 
this way, is more general (fig. lb). 

In the given training corpus, the probability of 
OgeniFEM in the context C/"' is po = 170/174=0.997 
(cf. po in PFRo in fig. 2), which is near to p*Q = f. The 
reduced context C?"' is used to form a PFR which will 
be stored. 



^ A small change in the probability caused by the elimination 
of fv-pairs from the context is admitted if it does not exceed a 
defined small percentage e. (We used e = 3%.) 
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We see in the use of reduced contexts instead of 
complete ones two advantages: 

(1) A great number of complete contexts containing 
many fv-pairs can lead after elimination of irrelevant 
fv-pairs to the same PFR, which makes the number 
of all possible PFRs much smaller than the number of 
all possible trigrams (cf. sec. 2). 

(2) The probability of a fv-pair can be estimated 
more exactly in a reduced context than in a complete 
one because of the higher frequencies in the first case. 

The Generation of PFRs 

In the training process we first extract from a train- 
ing corpus a set of trigrams where the tags are split 
up into their fv-pairs. From these trigrams a set of 
PFRs is generated separately for every fv-pair e,- . We 
examined four different methods for this procedure: 

Method 1-3: For every trigram we generate all 
possible subsets of its fv-pairs. Many trigrams, e.g. 
if they differ in only one fv-pair, have most of their 
subsets of fv-pairs in common. Both the complete 
trigrams and the subsets, constitute together the set 
of contexts and subcontexts (Ci and C?"') wherein a 
fv-pair could appear. To generate PFRs for a given 
fv-pair, we preselect and mark those (sub-)contexts 
which are supposed to have an infiuence on the con- 
textual probability of the fv-pair. A (sub-)context will 
not be preselected if its frequency is smaller than a 
defined threshold. We use different ways for the pres- 
election: 

Method 1: A (sub-)context will be preselected if 
the considered fv-pair itself or an fv-pair belong- 
ing to the same feature type ever appears in this 
(sub-) context. E.g., if gen:MAS appears in a certain 
(sub-)context then this (sub-)context will be prese- 
lected for gen:FEM too. Furthermore, it is possible 
to impose special conditions on the preselection, e.g. 
that a (sub-)context can only be preselected if it con- 
tains a POS feature in tag ti and (cf. fig. la: 
Opos and Ipos). 

Method 2: In order to preselect (sub-)contexts for an 
fv-pair, we generate a decision tree® (Quinlan, 1983) 
where the feature of the fv-pair, e.g. gen, num etc., 
serves to classify all existing (sub-)contexts. E.g., num 
produces three classes of contexts: those containing 
the fv-pair Onum:SG, those with Onum:PL and those 
without a Onum feature. We assign to the tree nodes 
other features than this upon which the classification is 
based. The root node is labeled with the feature from 
which we expect most information about the proba- 
bility of the currently considered feature. The values 
of the root node feature are assigned to the branches 
starting at the root node. We continue the branch- 
ing until there remain no features with an expected 
information gain and a frequency higher than defined 



thresholds. To every leaf of the tree corresponds a 
(sub-)context which will be marked and thus prese- 
lected for further analysis. 

Method 3: For each fv-pair concerning POS we pre- 
select every (sub-)context containing only POS fea- 
tures in tag ti-2 and (classical POS trigram), e.g. 
2pos:PREP Ipos.-DET for Opos:NOUN. For the other 
fv-pairs we mark every (sub-)context containing any 
fv-pair of the same type in the previous tag and 
any POS features in tag and ti, e.g. lpos:DET 
lgen:FEM Opos:NOUN ioi Ogen:FEM. 

With the methods 1-3, we next eliminate from ev- 
ery preselected (sub-)context all fv-pairs which in the 
above described sense do not infiuence the relative fre- 
quency of the currently considered fv-pair (eq. 8). 

Method 4: From the set of trigrams extracted from 
a training corpus we generate separately for every fv- 
pair, a binary-branched decision tree which shall de- 
scribe various contextual probabilities of this fv-pair. 
The tree is generated on a modified version of the ID3 
algorithm (Quinlan, 1983) and is similar to the one 
described by Schmid (1994). 

We start with a binary classification of all trigrams 
based on the considered fv-pair. E.g., a classification 
for gen:FEM will divide the set of trigrams in two 
subsets, one where the trigrams contain Ogen:FEM in 
the tag ti and one where they do not. 



gen:FEM 




( 0.0000 ) lgen:FEM 



^suggested by Helmut Schmid, IMS, Univ. Stuttgart, Ger- 
many. For reasons of space we explain only how we employ 
decision trees for our purposes. For details about the automatic 
generation of such trees see Quinlan (1983). 



(0.7727 ) ( 0.9693 ) 



Figure 3: Decision tree for the fv-pair 
Ogen:FEM (Every number is a probability of 
Ogen:FEM in the context described by the 
path from the root node to the node labeled 
with the number.) 

The tree is built up recursively (fig. 3). At each 
step, i.e. with the construction of each node, we test 
which one of the other fv-pairs delivers most infor- 
mation concerning the above-described classification. 
The current node will be labeled with this fv-pair. One 
of its two branches concerns the trigrams which con- 
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p{ Ogen:FEM Onum:SG Opos:ADJ | lgen:FEM lnum:SG lpos:NOUN 

2gen:FEM 2num:SG 2pos:DET 2typ:DEF) = 44/298 = 0.148 

Po{ Ogen:FEM | Onum:SG Opos:ADJ lgen:FEM lnum:SG lpos:NOUN 

2gen:FEM 2num:SG 2pos:DET 2typ:DEF) = 44/44 = 1.0 
PFRo : ( Ogen:FEM | Opos:ADJ lgen:FEM ; po = 170/174 = 0.977) 

Onum:SG | Ogen:FEM Opos:ADJ lgen:FEM lnum:SG lpos:NOUN 

2gen:FEM 2num:SG 2pos:DET 2typ:DEF) = 44/44 = 1.0 
PFRi : ( Onum:SG | Opos:ADJ lnum:SG 2pos:DET ; pi = 96/96 = 1.0) 

P2{ Opos:ADJ I lgen:FEM lnum:SG lpos:NOUN 

2gen:FEM 2num:SG 2pos:DET 2typ:DEF) = 44/298 = 0.148 
PFR2 : ( Opos:ADJ | lgen:FEM lpos:NOUN 2pos:DET ; p2 = 69/465 = 0.148) 

2 

Y[p. = 0.145 

8 = 

The position index at the beginning of every feature- value-pair indicates the tag to which 
it belongs; e.g. Ogen:FEM belongs to tag and 2num:SG to t,-2- 

Figure 2: Decomposition and reconstruction of a contextual tag probability (state 
transition probability) using prohahtUsttc feature relations (PFR) 



tain the fv-pair, the other branch concerns the tri- 
grams which do not contain it. The recursive expan- 
sion of the tree stops if either the information gained 
by consulting further fv-pairs or the frequencies upon 
which the calculus is based are smaller than defined 
thresholds. 

4 TAGGING ALGORITHM 

Starting point for the implementation of a feature 
structure tagger was a second-order-HMM tagger (tri- 
grams) based on a modified version of the Viterbi al- 
gorithm (Viterbi, 1967; Church, 1988) which we had 
earlier implemented in C (Kempe ,1994). There we 
replaced the function which estimated the contextual 
probability of a tag (state transition probability) by 
dividing a trigram frequency by a bigram frequency 
(eq. 3) with a function which accomplished this cal- 
culus either using PFRs in the above-described way 
(eq.s 6, 7) or by consulting a decision tree (fig. 3). 

To estimate the contextual probability of a tag we 
have to know the contextual probabilities of its fv- 
pairs in order to multiply them (eq. 6). 

Using PFRs generated by method 1 or 2, when 
e.g looking for the probability p2(0pos:ADJ |...) from 
figure 2, we may find in the list of PFRs, instead of 
a PFR which would directly correspond (but is not 
stored), the two PFRs 

(Opos:ADJ I lgen:FEM lpos:NOUN 2pos:DET; 

pi = 0.148) 

(Opos:ADJ I Onum:SG lnum:SG lsyn:NOUN 2syn:DET; 

P2 = 0.414) 

Both of them contain subsets of the fv-pairs of the 
required complete context and could therefore both be 



applied. In such case we need to know how to combine 
pi and p2 in order to get p (=P2 in fig- 2). 

As there exists no mathematical relation between 
these three probabilities, we simply average pi and p2 
to get p because this gives as good tagging results as a 
number of other more complicated approaches which 
we examined. 

PFRs generated by method 3 do not create this 
problem. For every complete context only one PFR is 
stored. 

When we use the set of decision trees generated by 
method 4, we obtain for every fv-pair in every pos- 
sible context only one probability by going down on 
the relevant branches until a probability information 
is reached. 

In opposition to the PFRs of the other methods, the 
decision trees also contain negative information about 
the context of an fv-pair, i.e. not only which fv-pairs 
have to be in the context but also which ones must be 
absent. 

5 TAGGING RESULTS 

In the training and tagging process we experimented 
with different values for parameters like: minimal ad- 
mitted frequency for preselection, admitted percentual 
difference e between probabilities considered to be 
equal, etc. (cf. sec. 3). 

The feature structure tagger was trained on the 
French 10,000 words corpus already mentioned in ta- 
ble 1, with the four different training methods (sec. 3). 
When tagging a 6,000 words corpus® with an average 
ambiguity of 2.63 tags per word (after the dictionary 

^No overlap between training and test corpora. 
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look-up) we obtained in the best case an accuracy of 
88.89 % (table 2). 



tag- 
ger 


training corpus 


tag set 


HMM 
order 


tagging 
accuracy 


number 
of words 


lan- 
guage 


tags 


fv- 
prs. 


tT 


2,000,000 


Englisl 


47 


— 


1 


94.93 % 


tT 


2,000,000 


Englisl 


47 


— 


2 


96.16 % 


tT 


10,000 


French 


386 


57 


1 


56.39 % 


tT 


10,000 


French 


386 


57 


2 


83.23 % 


IpT 


10,000 


French 


386 


57 


2 


83.81 % 


fsTl 


10,000 


French 


386 


57 




88.53 % 


fsT2 


10,000 


French 


386 


57 




88.89 % 


fsT3 


10,000 


French 


386 


57 




88.44 % 


fsT4 


10,000 


French 


386 


57 




88.14 % 



tT "traditional" HMM-tagger, 

IpT ^ "Tagger" considering only lexical probabilities, 
fsT1..4 feature structure tagger 

trained with method 1..4, 
HMM order 1 bigrams, 2 trigrams 

Table 2: Comparison of the tagging accuracy with 
different taggers, corpora, tag sets and HMM orders 

Comparatively, we used a "traditional" HMM- 
tagger (cf. sec. 4) on the same training and test 
corpora and got an accuracy of 83.23 % ^, i.e. the 
error rate was about 50 % higher than with the fea- 
ture structure tagger (table 2). 

When we used a tool which always selects the lexi- 
cally most probable tag without considering the con- 
text we obtained an accuracy of 83.81 %, which is even 
better than with the "traditional" HMM-tagger. 

Provided with enough training data and working 
on a small tag set, our "traditional" tagger got an 
accuracy of 96.16 % (Kempe ,1994), which is usual in 
this case (Cutting et al.,1992). The English test corpus 
we used here had an average ambiguity of 2.61 tags per 
word which is amazingly similar to the ambiguity of 
the French corpus. 

The feature structure tagger is clearly better when 
the available training corpus is small and the tag set 
large but the tags are decomposable into few fv-pairs. 

6 FURTHER RESEARCH 

We intend to search for other similar models while 
keeping in mind the basic idea described above: Split- 
ting up a tag into fv-pairs and deducing its contextual 
probability from the contextual probabilities of its fv- 
pairs. 

Furthermore, it may be preferable to split up the 
tags only when the frequencies are too small®. 



^For a similar experiment for German (20,000 words training 
corpus, 689 tags, trigrams) an accuracy of 72.5 % has been 
reported (Wothke et al., 1993, }), 21). 

^suggested by Ted Briscoe, Rank Xerox Research Centre, 
Grenoble, France 
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